{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "functional-corrections",
   "metadata": {},
   "source": [
    "## K-means "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "109c1cfe",
   "metadata": {},
   "source": [
    "K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into k - clusters. Goal: to partition a given dataset into k (predefined) clusters.\n",
    "\n",
    "The k-means algorithm works by first randomly initializing k cluster centers, one for each cluster. Each data point in the dataset is then assigned to the nearest cluster center based on their distance. The distance metric used is typically Euclidean distance, but other distance measures such as Manhattan distance or cosine similarity can also be used.\n",
    "\n",
    "After all the data points have been assigned to a cluster, the algorithm calculates the new mean for each cluster by taking the average of all the data points assigned to that cluster. These new means become the new cluster centers. The algorithm then repeats the assignment and mean calculation steps until the cluster assignments no longer change or until a maximum number of iterations is reached.\n",
    "\n",
    "The final output of the k-means algorithm is a set of k clusters, where each cluster contains the data points that are most similar to each other based on the distance metric used. The algorithm is commonly used in various fields such as image segmentation, market segmentation, and customer profiling.\n",
    "\n",
    "\n",
    "```\n",
    "Initialize:\n",
    "- K: number of clusters\n",
    "- Data: the input dataset\n",
    "- Randomly select K initial centroids\n",
    "\n",
    "Repeat:\n",
    "- Assign each data point to the nearest centroid (based on Euclidean distance)\n",
    "- Calculate the mean of each cluster to update its centroid\n",
    "- Check if the centroids have converged (i.e., they no longer change)\n",
    "\n",
    "Until:\n",
    "- The centroids have converged\n",
    "- The maximum number of iterations has been reached\n",
    "\n",
    "Output:\n",
    "- The final K clusters and their corresponding centroids\n",
    "```\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "36cafa73",
   "metadata": {},
   "source": [
    "## Code \n",
    "Here's an implementation of k-means clustering algorithm in Python from scratch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ab3cb277",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class KMeans:\n",
    "    def __init__(self, k, max_iterations=100):\n",
    "        self.k = k\n",
    "        self.max_iterations = max_iterations\n",
    "        \n",
    "    def fit(self, X):\n",
    "        # Initialize centroids randomly\n",
    "        self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]\n",
    "        \n",
    "        for i in range(self.max_iterations):\n",
    "            # Assign each data point to the nearest centroid\n",
    "            cluster_assignments = []\n",
    "            for j in range(len(X)):\n",
    "                distances = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
    "                cluster_assignments.append(np.argmin(distances))\n",
    "            \n",
    "            # Update centroids\n",
    "            for k in range(self.k):\n",
    "                cluster_data_points = X[np.where(np.array(cluster_assignments) == k)]\n",
    "                if len(cluster_data_points) > 0:\n",
    "                    self.centroids[k] = np.mean(cluster_data_points, axis=0)\n",
    "            \n",
    "            # Check for convergence\n",
    "            if i > 0 and np.array_equal(self.centroids, previous_centroids):\n",
    "                break\n",
    "            \n",
    "            # Update previous centroids\n",
    "            previous_centroids = np.copy(self.centroids)\n",
    "        \n",
    "        # Store the final cluster assignments\n",
    "        self.cluster_assignments = cluster_assignments\n",
    "    \n",
    "    def predict(self, X):\n",
    "        # Assign each data point to the nearest centroid\n",
    "        cluster_assignments = []\n",
    "        for j in range(len(X)):\n",
    "            distances = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
    "            cluster_assignments.append(np.argmin(distances))\n",
    "        \n",
    "        return cluster_assignments"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "538027c3",
   "metadata": {},
   "source": [
    "The KMeans class has an __init__ method that takes the number of clusters (k) and the maximum number of iterations to run (max_iterations). The fit method takes the input dataset (X) and runs the k-means clustering algorithm. The predict method takes a new dataset (X) and returns the cluster assignments for each data point based on the centroids learned during training.\n",
    "\n",
    "Note that this implementation assumes that the input dataset X is a NumPy array with each row representing a single data point and each column representing a feature. The algorithm also uses Euclidean distance to calculate the distances between data points and centroids.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1724d308",
   "metadata": {},
   "source": [
    "### Test "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "141e9843",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]\n",
      "[[-5.53443211 -5.13920695]\n",
      " [ 4.46522152  5.04931144]]\n"
     ]
    }
   ],
   "source": [
    "\n",
    "x1 = np.random.randn(5,2) + 5\n",
    "x2 = np.random.randn(5,2) - 5\n",
    "X = np.concatenate([x1,x2], axis=0)\n",
    "\n",
    "# Initialize the KMeans object with k=3\n",
    "kmeans = KMeans(k=2)\n",
    "\n",
    "# Fit the k-means model to the dataset\n",
    "kmeans.fit(X)\n",
    "\n",
    "# Get the cluster assignments for the input dataset\n",
    "cluster_assignments = kmeans.predict(X)\n",
    "\n",
    "# Print the cluster assignments\n",
    "print(cluster_assignments)\n",
    "\n",
    "# Print the learned centroids\n",
    "print(kmeans.centroids)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "04430ff9",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fa0fb8d4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAAD4CAYAAADxeG0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAPAUlEQVR4nO3df6hkZ33H8c/n7ir11kjEvRKa3Z1JaFOaaiBlEizBWpMoUZekf/QP7UTS+sfQUEMChjTxQv+7IFrUglIZ0pSCAyForEW0mrRW6B9GZ/PDGjeREPZuNhoysQWlVxKW/faPmdvdvbl379x7njmz33PfL1jmzjNnn/M97O5nnj3Pc85xRAgAkNfCvAsAAFRDkANAcgQ5ACRHkANAcgQ5ACS3fx47PXDgQLTb7XnsGgDSOnr06CsRsbSxfS5B3m63NRwO57FrAEjL9upm7ZxaAYDkCHIASI4gB4DkCHIASI4gB4DkCHIAmKHBQGq3pYWF8etgUH4fc1l+CAB7wWAg9XrS2tr4/erq+L0kdbvl9sOIHABmZHn5TIivW1sbt5dEkAPAjJw4sbP23SLIAWBGDh/eWftuEeQAMCMrK9Li4rlti4vj9pIIcgCYkW5X6velVkuyx6/9ftmJTolVKwAwU91u+eDeqMiI3PbFtr9i+xnbx2z/YYl+AQDbKzUi/ztJ/xoRf2r7jZIWt/sNAIAyKge57bdI+iNJfy5JEfGapNeq9gsAmE6JUyuXSxpJ+kfbT9i+3/ZvbtzIds/20PZwNBoV2C0AQCoT5Psl/YGkv4+IqyX9r6R7N24UEf2I6EREZ2npdU8qAgDsUokgPynpZEQ8Nnn/FY2DHQBQg8pBHhEvSXrB9u9Omm6Q9JOq/QIAplNq1codkgaTFSvPS/qLQv0CALZRZB15RDw5Of99VUT8SUT8T4l+ATRPHffn3mu4shNAbeq6P/dew71WANSmrvtz7zUEOYDa1HV/7r2GIAdQm7ruz73XEOQAalPX/bkvRLOc5CXIAdSmrvtzX2jWJ3lXV6WIM5O8pcKcIAdQq25XOn5cOn16/JotxAeDgdrtthYWFtRutzWYIo1nPcnL8kMAmNJgMFCv19PaJJVXV1fVm6yf7J7nG2nWk7yMyAFgSsvLy/8f4uvW1ta0vM3QetaTvAQ5AEzpxBZD6K3a1816kpcgB4ApHd5iCL1V+7pZT/IS5AAwpZWVFS1uGFovLi5qZYqh9SwneQlyAJhSt9tVv99Xq9WSbbVaLfX7/fNOdNbBEVH7TjudTgyHw9r3CwCZ2T4aEZ2N7YzIASA5ghwAkiPIASA5ghwAkiPIAWBG6nqsHfdaAYAZqPOxdozIAWAG6nysHUEOADNQ52PtigW57X22n7D9jVJ9AkBWdT7WruSI/E5Jxwr2BwBp1flYuyJBbvugpA9Jur9EfwCQ3dl3PJSkffvOnCMvvXql1KqVz0u6R9JFW21guyepJ21/y0cAaIL11SmzXr1SeURu+4iklyPi6Pm2i4h+RHQiorO0tFR1twCQQh2rV0qcWrlO0s22j0t6UNL1tr9coF8ASK+O1SuVgzwi7ouIgxHRlvRhSf8eEbdWrgwAGqCO1SusIweAGapj9UrRII+I/4iIIyX7BIDMZv28Tol7rQDAzHW75e+vcjZOrQBAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRXOchtH7L9XdvHbD9t+84ShQEAprO/QB+nJH0iIh63fZGko7YfiYifFOgbALCNyiPyiPh5RDw++flXko5JurRqvwCA6RQ9R267LelqSY9t8lnP9tD2cDQaldwtAOxpxYLc9pslfVXSXRHxy42fR0Q/IjoR0VlaWiq1WwDY84oEue03aBzig4h4uESfAIDplFi1Ykn/IOlYRHy2ekkAgJ0oMSK/TtJHJV1v+8nJrw8W6BcAMIXKyw8j4j8luUAtAIBd4MpOAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5IoEue2bbD9r+znb95boEwAwncpBbnufpC9K+oCkKyV9xPaVVfsFAEynxIj8WknPRcTzEfGapAcl3VKgXwDAFEoE+aWSXjjr/clJ2zls92wPbQ9Ho1GB3QIApDJB7k3a4nUNEf2I6EREZ2lpqcBuAQBSmSA/KenQWe8PSvpZgX4BAFMoEeQ/lPQ7ti+z/UZJH5b0LwX6BQBMYX/VDiLilO2PS/q2pH2SHoiIpytXBgCYSuUgl6SI+Kakb5boCwCwM1zZCQDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJVQpy25+x/YztH9n+mu2LC9UFAJhS1RH5I5LeERFXSfqppPuqlwQA2IlKQR4R34mIU5O335d0sHpJAICdKHmO/GOSvrXVh7Z7toe2h6PRqOBuAWBv27/dBrYflXTJJh8tR8TXJ9ssSzolabBVPxHRl9SXpE6nE7uqFgDwOtsGeUTceL7Pbd8m6YikGyKCgAaAmm0b5Odj+yZJfy3pPRGxVqYkAMBOVD1H/gVJF0l6xPaTtr9UoCYAwA5UGpFHxG+XKgQAsDtc2QkAyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJBckSC3fbftsH2gRH8AgOlVDnLbhyS9T9KJ6uUAAHaqxIj8c5LukRQF+gIA7FClILd9s6QXI+KpKbbt2R7aHo5Goyq7BQCcZf92G9h+VNIlm3y0LOmTkt4/zY4ioi+pL0mdTofROwAUsm2QR8SNm7XbfqekyyQ9ZVuSDkp63Pa1EfFS0SoBAFvaNsi3EhH/Jent6+9tH5fUiYhXCtQFAJgS68gBILliQR4R7bmPxgcDqd2WFhbGr4PBXMsBgDo0Z0Q+GEi9nrS6KkWMX3u9+YQ5XygAatScIF9eltbWzm1bWxu31+lC+kIBsCc0J8hPbHFh6Vbts3KhfKEA2DOaE+SHD++sfVYulC8UAHtGc4J8ZUVaXDy3bXFx3F6nC+ULBcCe0Zwg73alfl9629vOtL3pTfXXcaF8oQDYM5oT5Ot+/eszP//iF/VPNK5/obRakj1+7ffH7QAwA46o/7YnnU4nhsNh+Y7b7fEqkY1aLen48fL7A4Aa2T4aEZ2N7c0akW8yoTiQ1F5d1cLCgtrttgYsAwTQMM0K8g0TigNJPUmrkiJCq6ur6vV6hDmARskT5NNcLblhonFZ0oYV3VpbW9Mya7oBNMiu735Yq/WrJdcvtFm/WlI6dxJx/eflZenECZ3Y4vz/CdZ0A2iQHCPynVwt2e2OJzZPn9bhVmvT7g6zphtAg+QI8l1eLbmysqLFDWu6FxcXtcKabgANkiPId3m1ZLfbVb/fV6vVkm21Wi31+311WdMNoEFyrCPfeI5cGk9qcqENgD0k9zpyrpYEgC3lWLUijUOb4AaA18kxIgcAbIkgB4DkCHIASK5ykNu+w/aztp+2/ekSRQEApldpstP2eyXdIumqiHjV9tvLlAUAmFbVEfntkj4VEa9KUkS8XL0kAMBOVA3yKyS92/Zjtr9n+5qtNrTdsz20PRyNRhV3CwBYt+2pFduPSrpkk4+WJ7//rZLeJekaSQ/Zvjw2uVw0IvqS+tL4ys4qRQMAztg2yCPixq0+s327pIcnwf0D26clHZDEkBsAalL11Mo/S7pekmxfIemNkl6p2CcAYAeqBvkDki63/WNJD0q6bbPTKrWY5glCANBAlZYfRsRrkm4tVMvuTfsEIQBooGZc2bmTJwgBQMM0I8h3+QQhAGiCZgT5Lp8gBABN0IwgX1kZPzHobIuL43YAaLhmBDlPEAKwh+V5QtB2eIIQgD2qGSNyANjDCHIASI4gB4DkCHIASI4gB4DkPI97XNkeSVqt0MUBNfsui00+Po4tryYfX5Zja0XE0sbGuQR5VbaHEdGZdx2z0uTj49jyavLxZT82Tq0AQHIEOQAklzXI+/MuYMaafHwcW15NPr7Ux5byHDkA4IysI3IAwARBDgDJpQ5y23fYftb207Y/Pe96SrN9t+2wfWDetZRk+zO2n7H9I9tfs33xvGuqyvZNk7+Lz9m+d971lGT7kO3v2j42+bd257xrKs32PttP2P7GvGvZjbRBbvu9km6RdFVE/L6kv51zSUXZPiTpfZKa+Ly6RyS9IyKukvRTSffNuZ5KbO+T9EVJH5B0paSP2L5yvlUVdUrSJyLi9yS9S9JfNez4JOlOScfmXcRupQ1ySbdL+lREvCpJEfHynOsp7XOS7pHUuNnoiPhORJyavP2+pIPzrKeAayU9FxHPR8Rrkh7UeJDRCBHx84h4fPLzrzQOvEvnW1U5tg9K+pCk++ddy25lDvIrJL3b9mO2v2f7mnkXVIrtmyW9GBFPzbuWGnxM0rfmXURFl0p64az3J9WgoDub7bakqyU9NudSSvq8xoOm03OuY9cu6CcE2X5U0iWbfLSsce1v1fi/etdIesj25ZFkPeU2x/ZJSe+vt6Kyznd8EfH1yTbLGv+3fVBnbTPgTdpS/D3cCdtvlvRVSXdFxC/nXU8Jto9Iejkijtr+4zmXs2sXdJBHxI1bfWb7dkkPT4L7B7ZPa3zjm1Fd9VWx1bHZfqekyyQ9ZVsan3Z43Pa1EfFSjSVWcr4/O0myfZukI5JuyPLlex4nJR066/1BST+bUy0zYfsNGof4ICIennc9BV0n6WbbH5T0G5LeYvvLEXHrnOvakbQXBNn+S0m/FRF/Y/sKSf8m6XADQuEcto9L6kREhjuzTcX2TZI+K+k9EZHii/d8bO/XeNL2BkkvSvqhpD+LiKfnWlghHo8o/knSf0fEXXMuZ2YmI/K7I+LInEvZscznyB+QdLntH2s8uXRb00K8wb4g6SJJj9h+0vaX5l1QFZOJ249L+rbGE4EPNSXEJ66T9FFJ10/+vJ6cjGBxgUg7IgcAjGUekQMARJADQHoEOQAkR5ADQHIEOQAkR5ADQHIEOQAk93+igTL51gL1hQAAAABJRU5ErkJggg==",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "# Plot the data points with different colors based on their cluster assignments\n",
    "colors = ['r', 'b']\n",
    "for i in range(kmeans.k):\n",
    "    plt.scatter(X[np.where(np.array(cluster_assignments) == i)][:,0], \n",
    "                X[np.where(np.array(cluster_assignments) == i)][:,1], \n",
    "                color=colors[i])\n",
    "\n",
    "# Plot the centroids as black circles\n",
    "plt.scatter(kmeans.centroids[:,0], kmeans.centroids[:,1], color='black', marker='o')\n",
    "\n",
    "# Show the plot\n",
    "plt.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "69fc2d74",
   "metadata": {},
   "source": [
    "### Optimization \n",
    "Here are some ways to optimize the k-means clustering algorithm:\n",
    "\n",
    "Random initialization of centroids: Instead of initializing the centroids using the first k data points, we can randomly initialize them to improve the convergence of the algorithm. This can be done by selecting k random data points from the input dataset as the initial centroids.\n",
    "\n",
    "Early stopping: We can stop the k-means algorithm if the cluster assignments and centroids do not change after a certain number of iterations. This helps to avoid unnecessary computation.\n",
    "\n",
    "Vectorization: We can use numpy arrays and vectorized operations to speed up the computation. This avoids the need for loops and makes the code more efficient.\n",
    "\n",
    "Here's an optimized version of the k-means clustering algorithm that implements these optimizations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "121e7b70",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "class KMeans:\n",
    "    def __init__(self, k=3, max_iters=100, tol=1e-4):\n",
    "        self.k = k\n",
    "        self.max_iters = max_iters\n",
    "        self.tol = tol\n",
    "    \n",
    "    def fit(self, X):\n",
    "        # Initialize centroids randomly\n",
    "        self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]\n",
    "        \n",
    "        # Iterate until convergence or maximum number of iterations is reached\n",
    "        for i in range(self.max_iters):\n",
    "            # Assign each data point to the closest centroid\n",
    "            distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)\n",
    "            cluster_assignments = np.argmin(distances, axis=1)\n",
    "            \n",
    "            # Update the centroids based on the new cluster assignments\n",
    "            new_centroids = np.array([np.mean(X[np.where(cluster_assignments == j)], axis=0) \n",
    "                                      for j in range(self.k)])\n",
    "            \n",
    "            # Check for convergence\n",
    "            if np.linalg.norm(new_centroids - self.centroids) < self.tol:\n",
    "                break\n",
    "                \n",
    "            self.centroids = new_centroids\n",
    "    \n",
    "    def predict(self, X):\n",
    "        # Assign each data point to the closest centroid\n",
    "        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)\n",
    "        cluster_assignments = np.argmin(distances, axis=1)\n",
    "        \n",
    "        return cluster_assignments\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0a8514c5",
   "metadata": {},
   "source": [
    "This optimized version initializes the centroids randomly, uses vectorized operations for computing distances and updating the centroids, and checks for convergence after each iteration to stop the algorithm if it has converged."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a98d4ac5",
   "metadata": {},
   "source": [
    "Follow ups:\n",
    "\n",
    "* Computattional complexity: O(it * knd)\n",
    "* Improve space: use index instead of copy\n",
    "* Improve time: \n",
    "  * dim reduction\n",
    "  * subsample (cons?)\n",
    "* mini-batch\n",
    "* k-median https://mmuratarat.github.io/2019-07-23/kmeans_from_scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a756163a",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
